feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer #1283

stiepan · 2025-11-21T21:43:07Z

Description

Adds ccx.utils.StridedLayout for describing ndim-tensor layout (shape, strides in counts, itemsize).
- Creating layout directly from shape/strides:
  StridedLayout(shape, strides, itemsize), StridedLayout(a.shape, a.strides, a.itemsize, divide_strides=True)
- Creating dense layout with specific stride order (C|F or permutation) StridedLayout.dense(shape, itemsize, stride_order)
- Creating dense layout from another one: StridedLayout.dense_like and self.to_dense
- Implements properties: ndim, shape, volume, strides, strides_in_bytes, stride_order, min_offset, max_offset, is_contiguous(c|f|any), is_unique
- required_size_in_bytes method for required allocation size
- Stride manipulation helpers for reshaping, (un)squeezing, permuting, flattening, repacking (changing itemsize, as in viewing float tenor as complex one), broadcasting, slicing.

From Python, StridedLayout is immutable, stride manipulation methods return a new instance. In Cython, to avoid temporary objects in a sequence of operations, layout manipulations methods can be run in place.
Please take a look at the StridedLayout docs for more details and examples.

Enables wrapping external allocation into Buffer (Buffer.from_handle(ptr, owner=obj)). The owner and memory resource cannot be specified together. The owner reference is kept until the Buffer is closed. Without the memory resource, Buffer now queries driver for host/device accessibility and device_id of the ptr.

StridedMemoryView uses now StridedLayout to represent the shape/strides.

For DLPack/CAI imported tensors, the layout is lazily created if needed.
There's a new class method from_buffer(buffer, layout, optional dtype) to create SMV from Buffer and StridedLayout. For example to implement empty_like() method for numpy array, but allocated on a device, one could:

def device_tensor_like(a : numpy.ndarray, device : ccx.Device) -> StridedMemoryView:
    a_view = StridedMemoryView(a, -1)
    # get the original layout of ``a`` and convert it to a dense layout
    # to avoid overallocating memory (e.g. if the ``a`` was sliced)
    layout = a_view.layout.to_dense()
    # get the required size in bytes to fit the tensor
    required_size = layout.required_size_in_bytes()
    # allocate the memory on the device
    device.set_current()
    mem = device.allocate(required_size)
    # create a view on the newly allocated device memory
    b_view = StridedMemoryView.from_buffer(mem, layout, a_view.dtype)
    return b_view

~~The StridedMemoryView can be now exported via dlpack.~~ (delayed for later)
~~The StridedMemoryView.copy_from, StridedMemoryView.copy_to allow copying data between views~~ (in a follow-up PR).

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-11-21T21:43:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

stiepan · 2025-11-21T21:46:49Z

cuda_core/docs/source/api.rst


   args_viewable_as_strided_memory

-   :template: dataclass.rst


dataclass.rst does not render methods.

class.rst omits cythonized properties

cyclass places attributes section just after the main class docstring. this way we can document the actual attributes at the end of the main docstring and they are followed by docstring of all the properties.

cuda_core/cuda/core/experimental/_memoryview.pyx

cuda_core/cuda/core/experimental/_dlpack.pyx

cuda_core/cuda/core/experimental/_memoryview.pyx

cuda_core/cuda/core/experimental/_layout.pxd

leofang

Checking in EOD progress. I haven't reviewed layout/memoryview.

Also, I assume you're working migrating the tests?

cuda_core/cuda/core/experimental/_dlpack.pyx

leofang · 2025-11-25T22:25:28Z

cuda_core/cuda/core/experimental/_dlpack.pyx

+            dl_tensor = &dlm_tensor_ver.dl_tensor
+            dlm_tensor_ver.version.major = DLPACK_MAJOR_VERSION
+            dlm_tensor_ver.version.minor = DLPACK_MINOR_VERSION
+            cpython.Py_INCREF(buf)


We need to call decref in the except: branch.

The cleaners helpers do so. I moved the cpython.Py_INCREF call next to assigning its address to manager_ctx, for the clearer "release" ptr semantics. If the manager_ctx is set, the DECREF needs to be called.

leofang · 2025-11-25T22:36:40Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

+        driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_MEMORY_TYPE,
+        driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL,
+    )
+    return driver.cuPointerGetAttributes(len(attrs), attrs, ptr)


TODO: cimport this from cydriver

🙃

I've actually had that this way initially, but seeing all the cdriver imports are gone from buffer, went along with the Python API. I can undig the previous variant.

Yeah, sorry. I think it's not "gone" gone, most likely @Andy-Jost found that we don't need many driver API calls in this file after the refactoring (#1205). But pointer attribute checks are in the hot path so we should cythonize it.

In fact, I am trying to catch up with what @fbusato is doing in C++ (NVIDIA/cccl#6733), which is an equivalent check (but for C++ mdspan instead of Python SMV).

Thanks for the reference! Looking at the logic in cccl, I adjusted managed memory "discovery". I am not sure if we need to go into so much details as trying to get particular memory pool and check if readability flag is set there, I did not add this, but can adjust if needed.

In any case, I moved back to cydriver API and added tests with host/device/managed/pinned from cuda malloc and pinned from cuda register.

For pinned memory, I am not 100% sure what happens on devices for which
CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM is false. I.e. if one registers host memory with cuda host register and passes the original pointer, while the pointer to access on device is different. I.e. would it be still memory_type = 0 or memory_type = host and what would be a desired is_device_accessible value.

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

leofang · 2025-11-25T22:40:04Z

cuda_core/cuda/core/experimental/include/layout.hpp

+#include <numeric>
+
+
+#define STRIDED_LAYOUT_MAX_NDIM 32


Q: is there any particular reason (apart from using 32-bit masks) we want to limit this to 32? There are libraries that use a higher limit, for example cuTENSOR (@yangcal would remember what the current limit is).

Initially, precompiling the copy kernels was on the table, so sticking to 32-ndim was more appealing. Now, I don't think there's anything preventing us from going to 64-dims. Especially, if there's really a need for that.

cuda_core/cuda/core/experimental/include/layout.hpp

cuda_core/cuda/core/experimental/_utils/cuda_utils.pxd

stiepan · 2025-11-27T13:59:58Z

The dlpack fix moved out of this PR is here #1291

stiepan · 2025-11-27T16:29:04Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

@leofang,

NB.

from cuda.bindings import driver driver.cuPointerGetAttributes(1, [driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL], 0)

returns

(<CUresult.CUDA_SUCCESS: 0>, [4294967294])
while the actual value returned by the underlying API is -2. Looks like some unsigned/signed type mismatch.

leofang · 2025-11-30T02:11:38Z

Let's kick off CI

leofang · 2025-11-30T02:11:48Z

/ok to test 3a904e7

github-actions · 2025-11-30T02:21:23Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1283/
https://nvidia.github.io/cuda-python/pr-preview/pr-1283/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1283/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1283/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

stiepan · 2025-12-01T13:42:43Z

A single test case failed - testing pointer attributes for host memory "manually pinned" with cuMemHostRregister. It failed on pre-condition assert that the memory is not device accessible before it is registered. And it failed in a second of two cases testing this. The test did not clean-up properly - it was missing unregister call. I am guessing we ended up with the same pointer in the second case. The 38ddb36 should fix that.

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

…pack export Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

…n in reshape, fix to dense with sliced views Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

… add tests for the attributes Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

kkraus14 · 2025-12-01T16:36:42Z

cuda_core/cuda/core/experimental/_memoryview.pyx

+    cdef int itemsize = nbits >> 3
+    if (itemsize << 3) != nbits:
+        raise ValueError("dl_tensor.dtype.bits must be a multiple of 8")


How should we handle things like 6-bit and 4-bit floats? https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h#L171-L181

Good question, I am not sure, really. AFAIK, we don't support it now anyway, the dtype parsing won't accept it.

Playing with https://github.com/jax-ml/ml_dtypes I see that numpy:

does not allow exporting such types through dlpack

does not pack those - one element corresponds to one byte offset.

We could work around it in different ways, just looking at the landscape now I am not sure what's the desired behavior. E.g. would a dlpack dtype for fractional fps have nbits * lanes divisible by 8 or the numpy behavior is a common thing - we should round the itemsize to 1?

cc @seberg for visibility and any thoughts

Given JAX and ml_dtypes uses unpacked layouts and generally we want to support packed layouts that sounds like a bit of an impedance mismatch.

NumPy's problem with packed layout is that the array object's ABI is all byte-strided/pointer based. There might be paths to address this.¹
One could create e.g. a 2xfloat4_e2m1fn with explicit functions to pack and unpack. Not sure that a dtype helps a lot if you need to unpack for some or even most operations (compared to packing/unpacking with a custom converter function).

Enabling ml_dtypes to import/export for DLPack in NumPy, we can do, I think. Export would be easy, even a __dlpack_dtype__ would work. Unfortunately, a proper import implementation needs some form of registration so it probably not very quick.

FWIW, DLPack can allow bitsize == 8 to represent the padded versions. I don't think there was a particular reason not to just spell that out.
But by itself that doesn't actually quite help us (not without NumPy allowing export or some hacks here to make it work).

Footnotes

I'll think about bit-sized a bit more. Even without a new type, it may be plausible to assume nobody really accesses an array that they don't know the dtype of (i.e. a bit-sized dtype) -- i.e. it's a bit of an ABI break, but likely doesn't really matter.
A normal 64bit pointer is probably enough to include a bit-offset (or element size?) than things may actually work with surprisingly few tweaks (I assume 32bit systems may not enough space in practice to store any offset in the pointer, but that seems OK).
(Unless something about passing char */void * breaks this.) ↩

Thanks @seberg for detailed overview.

So when receiving fp4/fp6 tensor through dlpack we can expect 3 cases:

The "padded/unpacked" tensor (what numpy + ml_dtypes does now). We can represent them with no changes here, as soon as we start accepting the dtype. In dlpack, setting bitlen to 8 for those seems to be a bit of grey area, but if it's there, the StridedLayout won't be a hindrance. It looks like copying those does not require anything special.

The packed, vectorized dtype (2xfloat4_e2m1fn, 4xfloat6_e2m3fn?). We just need to take lanes into account, i.e. itemsize=nbits*lanes//8; assert itemsize * 8 == nbits*lanes. I omitted the lanes, because we forbid lanes > 1 when creating the numpy.dtype, but this is a simple adjustment. Again, it looks like copying those does not require anything special.

Receiving the packed, non-vectorized dtype.
Little effort workaround would be to enforce scenario 2. I guess the actual tensors must have nice strides and round number of elements to make them performant, so this should be possible in any real-life case. We could either require users to pass views like in scenario 2 or stipulate that we do such a conversion and repack the layout automatically. If really necessary we could special-case the SMV to keep the original shape and strides and use the repacked StridedLayout only internally if a copy is requested.

kkraus14 · 2025-12-01T16:38:52Z

cuda_core/cuda/core/experimental/_memoryview.pyx

+    cdef bint is_allocated = buffer.owner is None
+    if is_allocated:
+        if buffer.memory_resource is None:
+            raise ValueError(
+                "Ambiguous buffer instance. The buffer must either hold an allocation "
+                "(coming from MemoryResource, e.g. Device().memory_resource.allocate()) "
+                "or wrap external data and specify the owner "
+                "(`Buffer.from_handle(ptr, size, owner=...)`)."
+            )
+    # For external buffers, we may not know the size. Even if we did, the size
+    # alone is not enough if the layout can map to negative offsets, i.e.:
+    # the valid range is not the [ptr, ptr + size - 1], but
+    # [ptr - offset, ptr + size - offset - 1]. The offset is not reported
+    # by the packages.
+    if is_allocated and buffer.size < layout.get_required_size_in_bytes():
+        raise ValueError(
+            f"Buffer size is too small for the layout. "
+            f"Expected at least {layout.get_required_size_in_bytes()} bytes, "
+            f"got {buffer.size} bytes."
+        )


These checks seem somewhat expensive to do on the path of getting the ptr to the data. Should we validate these only at construction time and avoid them in a hot path like this?

Those checks happen now for new SMV created with from_buffer or view methods. We could limit those to from_buffer only or get rid of them completely. I don't have strong preference either way here.

My 2c:

If Buffer isn't validating ownership, neither should SMV. If we wanted to require an owner we should do so at the Buffer level. Let's remove the is_allocated check entirely?

Given we're allowing effectively slicing a buffer at SMV construction time using offset and size, I think the layout size validation makes sense at from_buffer and view construction time.

cpcloud · 2025-12-01T17:42:38Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

+        with cython.gil:
+            # Device class handles the cuInit call internally
+            from cuda.core.experimental import Device
+            Device()


Can we call cuInit if that's what we're after, instead of calling Device?

cpcloud · 2025-12-01T17:47:38Z

cuda_core/tests/helpers/layout.py

+    F = "F"
+
+
+class _S:


We're already importing numpy where this is being used, can't we use np.s_ for this instead of duplicating that functionality solely for testing?

Cool, I wasn't aware of this utility.

stiepan commented Nov 21, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memoryview.pyx Outdated Show resolved Hide resolved

stiepan commented Nov 21, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_dlpack.pyx Outdated Show resolved Hide resolved

stiepan commented Nov 21, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memoryview.pyx Show resolved Hide resolved

stiepan changed the title ~~Introduce strided layout memview~~ feat: Introduce StridedLayout, support wrapping external allocations in Buffer, support creating StridedMemoryView from Buffer and StridedLayout, export SMV via dlpack. Nov 21, 2025

leofang added triage Needs the team's attention feature New feature or request cuda.core Everything related to the cuda.core module labels Nov 24, 2025

leofang requested review from leofang and oleksandr-pavlyk November 24, 2025 00:46

leofang assigned stiepan Nov 24, 2025

leofang reviewed Nov 25, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_layout.pxd Show resolved Hide resolved

leofang reviewed Nov 25, 2025

View reviewed changes

stiepan force-pushed the introduce_strided_layout_memview branch from c238c1a to dc27268 Compare November 26, 2025 18:06

stiepan marked this pull request as ready for review November 26, 2025 18:16

stiepan changed the title ~~feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer, export SMV via dlpack.~~ feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer Nov 26, 2025

stiepan commented Nov 27, 2025

View reviewed changes

stiepan added 6 commits December 1, 2025 15:45

Add StridedLayout

85d4a3a

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Support wrapping ptr in Buffer, create SMV from buffer and layout, dl…

23b35fe

…pack export Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Documentation, linting, minor fixes

2470617

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Add NotImplemented copy_from/copy_to

cf7eff5

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Adjust flattening scalars to numpy/cupy behavior, fix shape validatio…

79010b8

…n in reshape, fix to dense with sliced views Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Add StridedLayout tests

4ca3567

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

stiepan added 12 commits December 1, 2025 15:45

Use explicit int32_t instead of int in integer fused type

acdd6f8

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Disable (for now) exporting the SMV via dlpack

60a0d66

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Revert dlpack changes

1fa43d4

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Support layouts up to 64 dims

67c6c5e

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Use cydriver to query memory attributes, fix managed memory handling,…

a96bec5

… add tests for the attributes Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Test owner and mr cannot be specified together

91387b0

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Test Buffer.close with owner

91c0af9

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Add envelope checks (rquires_size_in_bytes, offset_bounds)

b74ef2c

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Docs, annotation fixes, remove dlpack export mentions

2c0343f

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Add SMV.from_buffer/view tests

598a2f1

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Layout tests for SMV created from CAI

bbb227b

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Fix missing host unregister call in buffer test

26dfe3b

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

stiepan force-pushed the introduce_strided_layout_memview branch from 38ddb36 to 26dfe3b Compare December 1, 2025 14:49

stiepan added 2 commits December 1, 2025 15:53

Fix num attrib on re-try

3adae5c

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

Call int on the buffer.handle

7554164

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

kkraus14 reviewed Dec 1, 2025

View reviewed changes

cpcloud mentioned this pull request Dec 1, 2025

[RFC] Make cuda.core a required dependency NVIDIA/numba-cuda#128

Open

10 tasks

cpcloud reviewed Dec 1, 2025

View reviewed changes

		#include <numeric>


		#define STRIDED_LAYOUT_MAX_NDIM 32

feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer #1283

Are you sure you want to change the base?

feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer #1283

Conversation

stiepan commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

stiepan Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stiepan Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stiepan commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang commented Nov 30, 2025

Uh oh!

leofang commented Nov 30, 2025

Uh oh!

github-actions bot commented Nov 30, 2025

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

stiepan commented Dec 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stiepan Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

stiepan Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

stiepan commented Nov 21, 2025 •

edited

Loading

stiepan Nov 21, 2025 •

edited

Loading

stiepan Nov 27, 2025 •

edited

Loading

stiepan Dec 1, 2025 •

edited

Loading

stiepan Dec 2, 2025 •

edited

Loading